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Abstract. In the context of clustering, we consider a generative model in a Euclidean ambient space with clusters 
of different shapes, dimensions, sizes and densities. In an asymptotic setting where the number of points becomes 
large, we obtain theoretical guaranties for a few emblematic methods based on pairwise distances: a simple algo- 
rithm based on the extraction of connected components in a neighborhood graph; the spectral clustering method 
of Ng, Jordan and Weiss; and hierarchical clustering with single linkage. The methods are shown to enjoy some 
near-optimal properties in terms of separation between clusters and robustness to outliers. The local scaling method 
of Zelnik-Manor and Perona is shown to lead to a near-optimal choice for the scale in the first two methods. We also 
provide a lower bound on the spectral gap to consistently choose the correct number of clusters in the spectral method. 
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ponents, spectral methods, hierarchical clustering with single linkage, manifold learning, minimax rates, detection in 
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1 Introduction 

In the context of clustering points in an Euclidean space, traditional methods such as i\-means or Gaussian 
mixture models, assume that each cluster is generated by sampling points in the vicinity of a centroid. The 
resulting clusters are ellipsoidal, and in particular full-dimensional. Several papers obtain theoretical results 
in this setting; see e.g., [43, 19, 46, 20, 1], and references therein. In a number of modern applications, 
however, the data may contain structures of mixed dimensions. Even the apparently simple case of affinc 
surfaces is a relevant model for a number of real- life situations [37]. Our focus here is a more general 
framework making minimal assumptions on the underlying clusters. Note that our framework is inclusive of 
the classical setting. 

1.1 Mathematical framework 

We set the ambient space to be the D-dimensional unit hypercube [0, 1} D , though our results may generalize 
to other setting such Riemannian manifolds. In most of the paper, we assume that D is fixed, and discuss 
the case where D is large in Section 5.3. For a positive integer d < D and a constant k > 1, let Sd(n) be the 
class of measurable, connected sets (surfaces) S C [0, 1) D such that 

Vx e S : «rV < voLj(B(x, e) n S) < ne d , Ve G [0, diam(S)]. (1) 

The condition above not only implies that the surface S has (e.g. Hausdorff) dimension d with finite d- 
volume, it also prevents S from being too narrow in some places. We also define Sq(k) as the set of points 
in [0, lp. We let S(k) = \Jd=oM^)- 

For readers more familiar with function spaces, note that the class S c i(k) contains for example the image 
/([0, l} d ) of locally bi-Lipschitz functions / : [0, l] d -> [0, 1} D satisfying: 

M^Wb-aW < \\f(b)-f(a)\\<M\\b-a\\, Va,6e [0, (2) 
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for M small enough. 

For S C [0, 1} D and t > 0, define 

B(S, t) = {x g [0, 1] D : min ||x - y|| < r}. 

yes 

This is the r-neighborhood of 5 in [0, l] D relative to the Euclidean metric. Given surfaces, Si, ... , Sk € <S(k), 
we generate clusters, X\, . . . ,Xk, by sampling Nk points in S(Sfe,r), the T-neighborhood of according 
to a distribution "Jfc with density tpk with respect to the uniform measure on B(Sk,T). We call r the noise 
level or sampling imprecision. In the noiseless case, r = 0, the points are sampled exactly on the surface. 
We require that /t _1 < ipk < k, so the cluster is somewhat uniformly sampled. Our results apply without 
major change for non-compactly supported sampling distributions with fast-decaying tails, such as Gaussian 
noise. The classical setting corresponds to cither d — (centroids), or d — D with r = (full-dimensional 
cluster). Let N = J^k be the total number of data points, which we denote by xj., . . . , xjv- For later use, 
define indices Ik = {i = 1, ■ • • , N : x, G Xk}. 

We assume the clusters do not intersect, and in fact that the underlying surfaces are well-separated: 

dist(S fc ,S £ ) := min ||x-y||>5, Vfc ^ t. (3) 

xGSfc.yGSf 

The actual clusters are therefore separated by a distance of at least 5 — 2r. 

Surface clustering task. Given data {xi, . . . ,xjv}» recover the clusters X\, . . . , Xk- 

1.2 In this paper 

We first consider a simple algorithm based on extracting the connected components of a e- neighborhood graph 
built using a compactly supported kernel. We provide conditions guarantying that the algorithm perfectly 
recovers the underlying clusters in our theoretical setting; this is done in Section 2.1. This approach may 
be seen as a precursor to spectral methods, which extract 'soft' connected components based on an eigen- 
decomposition of the Laplacian of the neighborhood graph. In Section 2.2, we consider the method introduced 
by Ng, Jordan and Weiss [40], a standard spectral clustering algorithm. We show that, in our framework, the 
spectral method operates under very similar conditions as the method based on connected components. The 
last method we consider, in Section 2.3, is hierarchical clustering with single linkage, which is in some sense 
equivalent to the method based on connected components. Note that hierarchical clustering with average or 
complete linkage are not suitable in our context which includes elongated clusters. 

It turns out that the first two methods are near-optimal in terms of separation between clusters and 
robustness to outliers. In Section 3, we show that, under low sampling noise, no method can perfectly separate 
clusters that are closer together than what the first two methods require by more than a poly-logarithmic 
factor. For clusters of dimension one or two, we obtain stronger results, showing that all clustering methods 
have in fact a non-negligible error rate in that same situation. In Section 4, we address the situation 
where outliers, points sampled elsewhere in space, may be present in the data, and show that the first two 
methods, properly modified, are able to accurately cluster within logarithmic factors of the best known 
detection rates [5, 3], even though the task of detection is a priori much easier than the task of clustering. 

In the discussion part of the paper, Section 5, we consider the choice of parameters, that is the scale 
defining the neighborhood graph and, for the spectral method, the number of eigenvectors to extract. We 
show that the local scaling method of Zelnik-Manor and Perona [51], with a number of nearest neighbors of 
order slightly larger than log(iV), leads to a near-optimal choice of scale. As a consequence, computations may 
be restricted to small neighborhoods without compromising the clustering performance, so that a nearest- 
neighbor search becomes the computational bottleneck. We also provide a bound on the eigengap allowing 
to consistently estimate the number of clusters. Finally, we discuss how the results generalize to the case 
where the ambient dimension is very large or even infinite. 

The various proofs are gathered at the end of the paper, in Section 6, with the proof of auxiliary results 
gathered in Appendices A and B. The careful reader will notice that our results could be made non- 
asymptotic without much change in the arguments. However, we chose to favor the statement of simple 
results with concise proofs. 
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1.3 Related work 



Neighborhood graphs defined on a random set of points in Euclidean space are sometimes called random 
geometric graphs, and have been of interest in modeling networks. The book by Penrose [42] is a standard 
reference. The main difference in our case is that the support of the sampling density may be (close to) 
singular with respect to the Lebesgue measure. Extracting connected components from a neighborhood 
graph is a natural idea and has been proposed before; we comment on three publications that are particularly 
relevant to us [38, 10, 14]. Maier, Hein and von Luxburg [38] consider fc-nearest neighbor type graphs and 
analyze the performance of the resulting clustering algorithm within a slightly more restrictive model where 
both the clusters and the sampling densities are smooth, and the degree of imprecision is positive, r > 0. 
Within that framework, the results in that paper are non-asymptotic and more precise than our Theorem 
1. Their emphasis is on choosing k optimally in terms of maximizing the probability of correctly solving the 
clustering task and on the effect of using different kinds of graphs. We comment on their work in more detail 
in Section 5. In a similar model, Biau, Cadre and Pelletier [10] focus on estimating the correct number of 
clusters based on counting the number of connected components in a e-neighborhood graph. Both [38, 10] 
consider the case where the space between clusters contains points; we call those points outliers and consider 
this situation in Section 4.1. Brito, Chvez, Quiroz and Yukich [14] consider a model similar to ours with 
all clusters full-dimensional. They also use a fc-nearest neighbor graph and show that, when the separation 
between clusters remains bounded away from zero, choosing k of order log TV makes the algorithm output the 
perfect clustering; this is similar to our Proposition 3. They also consider a test of non-uniformity, where the 
alternative is that of points clustered more closely together as opposed to a cluster hidden in a background 
of uniform points as we consider in Section 4. However, there are no optimality considerations. In light 
of [38, 10, 14], our contribution is in considering a slightly more general framework, for which we provide 
short proofs, and in establishing optimality results in terms of separation between clusters and robustness 
to outliers. 

Spectral clustering methods have been specifically developed to work in the kind of framework we consider 
here [23]. Though very popular, few theoretical results are available on the performance of spectral methods 
under this type of generative model. Ng, Jordan and Weiss, in their influential paper [40], introduce their 
method and outline a strategy to analyze it; however, no explicit probabilistic model is considered. The same 
comment holds for [31]. In [47, 41], spectral clustering is taken to its empirical process limit as the number of 
points increases; though this provides insight on what spectral clustering is estimating, there is no result on 
its performance. This is similar to the analysis in [39]. Other papers, such as [24], introduce variations on the 
spectral method and provide theoretical results on computational aspects, not on clustering performance. 
Closer in spirit to the present paper is the work of Chen and Lerman [16], where the authors analyze a 
multi-way spectral method specifically designed for the case of affine surfaces. Our contribution here is in 
providing theoretical guaranties for spectral clustering methods in a rigorous mathematical framework. In 
doing that, we provide a concise proof of the main result in [40] partly based on information that Andrew 
Ng shared with the author and the proof of [16, Th 4.5] by Chen and Lerman. 

To our knowledge, the minimax-type bounds on the separation between clusters obtained in Section 3 are 
the first of their kind in the context of clustering under a non-parametric model. In the classical setting, there 
is some existing literature, though very scarce; we will comment on a paper of Achlioptas and McSherry [1]. 
The literature is of course abundant in the context of estimation [50, 21, 32, 12] and classification [49, 44]. In 
our arguments, we use the popular approach consisting in reducing the task to a hypothesis testing problem. 

1.4 Additional notation 

Except for D, K and k, the parameters such d, r and e vary with TV. This dependence is left implicit. An 
event En holds with high probability if P (En) — * 1 as TV — > oo. We use standard notation, such as: a V b 
for max(a,6); a Ab for min(a, b); on ~< &jv for = 0(6jv); ajy x bjy for ajv = 0(6/v) and 6jv = O(oat); 
a-N <C b N for a N = o(6/v)- 
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2 Some standard clustering methods based on pairwise distances 



Wc describe some common approaches to clustering, all based on pairwise distances. Each time, we provide 
sufficient conditions for the method to output the perfect clustering. These conditions are seen to be necessary 
up to multiplicative logarithmic factors. We will see in later sections what these conditions imply in terms 
of comparative performance. 

The first two methods build a neighborhood graph on the data points using an affinity based on pairwise 
distances: 

a(zi!Z2) = /^- Z2 |IA), z^z 2 ; 

I V, Zi — z 2 . 

We assume the kernel <j> is non-negative, continuous at with 0(0) = 1, non-increasing on [0,oo) and is 
fast-decaying, in the sense that s q 4>{s) = o(l) for any q > 0. 



2.1 Clustering based on extracting connected components 

The first algorithm we introduce, Algorithm 1, extracts connected components of the neighborhood graph 
and therefore requires a compactly supported kernel; let [0,o;] be the support of <fi. 



Algorithm 1 Pairwise clustering based on extracting connected components 
Input: 

{xi, x 2 , xjv} C [0, 1] D : the data set 
e: affinity scale 
Output: 

A partition of the data into disjoint clusters 
Steps: 

1: Compute the affinity matrix W £ M. NxN , with Wij = a(xj,Xj). 

2: Extract the connected components of W. 

3: Accordingly group the original points into disjoint clusters. 



Theorem 1. Consider the generative model of Section 1.1 with surfaces S\, . . . , Sk £ S(k). Assume that 
5 — 2t > ue, with 

e» max max/ V diam(5 fc ))(log(iV fe )/7V fc ) 1 /^ , 1 

Then, Algorithm 1 is perfectly accurate with high probability. 

The proof of Theorem 1 is in Section 6.1. The condition S — 2t > we means that distinct clusters are 
separated by ue and therefore disjoint in the neighborhood graph. The term in brackets on the right hand 
side of (5) is actually the order of magnitude for the maximin distance between points sampled from B{Sk, t). 

Remark. In the classical setting where each Sk is a centroid, i.e. d k = 0, the algorithm is accurate when 

e»rma X (log(iV k )/iV t ) 1 / D . 

k 



2.2 Spectral clustering 

When using kernels that are not compactly supported, extracting connected components makes little sense 
as the neighborhood graph is fully connected. Instead, spectral methods perform an eigen-decomposition of 
the graph Laplacian. The spectral method introduced in [40] uses the Gaussian kernel 0(s) = e~ s I" 1 . Note 
that kernels of compact support arc considered in [41] in the context of spectral clustering. We describe 
the method of Ng, Jordan and Weiss [40] for a general kernel in Algorithm 2. The A'-means algorithm is 
initialized with centroids at nearly 90° angles, and then run with only one iteration. The initial centroids are 
chosen recursively, starting with any row vector of V and then choosing a row vector with largest minimal 
absolute angles with all the centroids previously chosen. 
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Algorithm 2 Pairwise spectral clustering 
Input: 

{xi, x 2 , xjy} C [0, 1]-°: the data set 
e: affinity scale 
K: the number of clusters 
Output: 

A partition of the data into K disjoint clusters 
Steps: 

1: Compute the affinity matrix W £ M. NxN , with Wij = a(xj,Xj). 

2: Compute the degree matrix D = diag{W • 1}, and Z = D~ 1/2 WD" 1/2 . 

3: Extract U = [ui, . . . , ujc], orthogonal eigenvectors of Z for its K largest eigenvalues. 

4: Renormalize each row of U to have unit norm and let V denote the resulting matrix. 

5: Apply A-means to cluster the row vectors of V in M. K . 

6: Accordingly group the original points into K disjoint clusters. 



Theorem 2. Consider the generative model described in Section 1.1 with surfaces Si, . . . ,Sk € S(n). Let 
ljn be such that N q <j)(ujM) = o(l) for any q > 0. Assume 8 —2t > uj^e and that (5) holds. Then, Algorithm 
2 is perfectly accurate with high probability. 

The proof of Theorem 2 is in Section 6.2. We see that Theorem 2 is very similar to Theorem 1; for 
example, with the Gaussian kernel, the separation condition is S — 2t ^> e-y/log A. The respective proofs are 
essentially parallel as well, though for the latter we follow the outline provided in [40]. Thus in theory and 
under our model, Algorithms 1 and 2 operate under similar conditions. In practice, however, it is well-known 
that Algorithm 1 is substantially more sensitive to the specification of the scale parameter e. 

2.3 Single linkage clustering 

In the setting of Section 1.1, there is no hope for hierarchical clustering methods using complete or average 
linkage unless the clusters are separated by a distance comparable to their diameter, or larger. This is the 
classical setting, where the goal is typically to form clusters with small diameter [20]. On the other hand, 
the "chaining" property of hierarchical clustering with single linkage is desirable in our context, especially 
if the cluster is truly lower-dimensional (e.g. generated by sampling near a curve). In fact, if we stop the 
procedure whenever the closest distance between clusters exceeds e, the resulting algorithm is equivalent to 
Algorithm 1 with kernel 4>(s) = l{s < 1}. The procedure is described in Algorithm 3. 



Algorithm 3 Single linkage clustering 
Input: 

{xi, x 2 , xtv} C R d : the data set 
e: maximum merging distance 
Output: 

A partition of the data into disjoint clusters 
Steps: 

0: Set each point to be a cluster. 

1: Recursively merge the two closest clusters in terms of minimal distance. 
2: Stop when the distance between any pair of clusters exceeds e. 



Corollary 3. Under the conditions of Theorem 1, Algorithm 3 is perfectly accurate with high probability. 

We mention the paper of Achlioptas and McSherry [1] , which introduces an algorithm based on a combi- 
nation of spectral clustering and single linkage clustering. Their analysis shows that their algorithm performs 
comparatively well in the classical setting. 
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3 Optimality in terms of separation between clusters 



From Theorem 1 we see that Algorithm 1 is able to correctly identify clusters separated by a distance in 
the order of the term on the right hand side of (5). In the classical setting, Algorithm 1 is accurate when 
5 > Cr, with C > 2; this is valid in any dimension, as explained in Section 5.3. The requirement is therefore 
comparable, actually weaker, than the lower bound achieved in [1, Th. 6]. This is assuming we can select 
an appropriate scale, which we do in Section 5.1. Note that the algorithm of Achlioptas and McSherry [1] 
requires selecting the correct number of clusters. 

In our framework, the degree of separation required by Algorithm 1 to be perfectly accurate is close to 
optimal when the noise level r is small, specifically, 

r ~< k min K diam(S k )(log(N k )/N k ) 1 / d '< . 

Theorem 4. For any clustering method and any probability p G (0, 1), there are surfaces Si, S2 € Sd(n) of 
diameter at least 1/2 and separated by S, with 5 — 2t >- (l/N) 1 ^ , such that, in the context of the generative 
model of Section 1.1, the method makes at least one mistake with probability at least p. 

The proof of Theorem 4 is in Section 6.3. 

Remark. Wc avoided the case of surfaces of mixed dimensions since the use of more sophisticated tools, 
such as local density or dimension estimation [28, 35], could possibly narrow the separation. 

The conclusion of Theorem 4 is rather weak, though, as it does not give conditions under which any 
clustering method has a substantial error rate (in terms of labeling the points). In dimensions one and two, 
we are able to prove such a result. In fact, we show that Algorithm 1 achieves the optimal separation rate, 
up to a constant factor in dimension one and up to a poly-logarithmic factor in dimension two. We were not 
able to prove such a result in higher dimensions. 

Theorem 5. For any clustering method, there are surfaces Si,S% £ <5>i(k) of diameter at least 1/8 and 
separated by S, with 5 — 2r >- log(N)/N, on which the method has an error rate exceeding 1/9 with high 
probability. 

The proof of Theorem 5 is in Section 6.4. 

Theorem 6. For any clustering method, there are surfaces S\,S2 € ^(k) of diameter at least 1/8 and 
separated by S, with S — 2r y 1/(N log(A^)ylog \og(N)), on which the method has an error rate exceeding 
1/9 with high probability. 

The proof of Theorem 6 is in Section 6.5. 

4 Optimality in terms of robustness 
4.1 Dealing with outliers 

So far we only considered the case where the data is devoid of outliers. We now assume that some outliers 
may be included in the data. The outliers are sampled from a distribution ^ with density tpo with respect 
to the uniform measure on [0, 1] D \ (J fe B(S k , 5), again with k^ 1 < ip < k. We assume this region is of 
D-volume bounded below by We denote by No the number of outliers. We highlight the fact that 

outliers arc away from surfaces by at least <5, the same lower bound on the distance that separates two 
distinct surfaces. 

The algorithms considered here are based on pairwise distances, so we need to assume that the outliers 
are not as densely sampled as the actual clusters, for otherwise they will be indistinguishable from non-outlier 
points. 

Proposition 1. Assume the conditions of Theorem 1 hold, now in a setting that includes outliers, and, 
in addition, that e <C N . Then, Algorithm 3 is perfectly accurate with high probability if, when the 
algorithm stops, singletons are labeled as outliers. 



6 



The proof of Proposition 1 is in Section 6.6. 

Algorithms 1 and 2 need to be modified in order to deal with outliers. We introduce an additional 
step which consists in discarding the data points with low connectivity in the neighborhood graph. This 
approach to removing outliers is very natural and was proposed in other works, such as [17, 38]. Specifically, 
fix a sequence ujn — > oo such that lun <C logiV; then, between steps 1 and 2, compute the degree matrix 
D = diag(W • 1) and discard the points with degree Di < oj^Ne + logiV. 

Lemma 1. For S G <Sd(«) and e, t > 0, and x G B(S, t), 

vol D (B(S,T)nB(x,e))^j(S,r,e), 

where 

7(5, r, e) = (r A e) D ~ d ((T A e) V (diam(S*) A e)) d . 

The proof of Lemma 1 is in Section B.l. Let j(S, t) = T D ~ d (T Vdiam(5')) d ; by Lemma 1, voId(B(S, t) >c 
l(S, r). 

Proposition 2. Assume the conditions of Theorem 1 hold, now in a setting that includes outliers. Suppose: 

jVfc 7( fg''"'! ) -»u N Ne D + log TV, Wk = l,...,K. (6) 

Then, Algorithms 1 and 2 (modified) are both perfectly accurate with high probability. 

The proof of Proposition 2 is in Section 6.7. With enough separation as assumed here, outliers are 
disconnected from non-outliers, and their degree is of order roughly Ne D . Therefore, they should be properly 
identified by the thresholding procedure. As for non-outliers, the term on the left hand side of (6) is the 
order of magnitude of the degree of points sampled from B(Sk,r), so that (6) essentially guaranties that 
non-outliers survive the thresholding step. 

In [38] outliers are sampled anywhere in space but away from clusters, which corresponds to having 
support [0, lp \ U fc B(S k ,r). In that case, perfect accuracy is impossible, as the algorithms will confuse 
outliers within e from B(Sk,T) with points belonging to X^. However, knowing that, with high probability, 
there are at most O(Noe) such outliers (in fact 0(Noe D ~ dk ) if r -< e, by Lemma 1), the algorithms make a 
mistake on a negligible fraction of outliers. 

4.2 Clustering at the detection threshold 

Assume each cluster is sufficiently sampled, which we rigorously define as: 

N k > (N dk /° V NT D - dk )\og{N), Vk = l,...,K. (7) 

Note that the related condition Nk x N dk / D is equivalent to requiring that, within each cluster, the distance 
between a point and its nearest-neighbor is of order N" 1 / . With (7) holding, the choice e = (log N/N) 1 / 
implies both (5) and (6), so that Algorithms 1 and 2 (modified) are perfectly accurate with high probability, 
even in a setting including outliers. 

Now, instead of clustering, consider the task of detecting the presence of a cluster hidden among a 
large number of outliers. We observe the data, x 1; . . . , xjy, and want to decide between the following two 
hypotheses: under the null, the points are all outliers; under the alternative, there is a surface S G Sd such 
that N\ points are sampled from B(S,t), while the rest of the points, N — N\ of them, are sampled as 
outliers. Assuming that the parameters d and r are known, it is shown in [5, 3] that the scan statistic is 
able to separate the null from the alternative if 

iVi > N d/D V Nr D - d . 

The author is not aware of a method that improves on those rates, and from translating recent results on 
detection in graphs [4], there is evidence that those rates are optimal up to a poly- logarithmic factor. This 
condition is essentially the same as (7), except for the log TV factor. Hence, Algorithms 1 and 2 (modified) 
solve the clustering task perfectly within a poly-logarithmic factor of the best known signal-to-noisc ratio 
required for the detection task. 
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5 Discussion 



5.1 Selecting the scale parameter 

Choosing the affinity scale e is critical in all algorithms described here, and more generally in any method 
which uses a neighborhood graph. Assuming (7) holds, we already saw that the choice e = (log N/N) 1 / 
implies (5), so that, with enough separation, the Algorithms 1, 2 and 3 are accurate. In terms of separation, 
this allows the clusters to be as close as (logN/N) 1 / . As seen in Section 3, this is not optimal. Though 
the choice of e may be made more precise with more information on the clusters, like the number of points 
sampled from them and their dimension, this information may not be available. 

In practice, choosing the scale is still an ongoing line of research, with similarities with bandwidth 
selection in kernel smoothing. We focus on the local scaling method of Zelnik-Manor and Perona [51], where 
a(x,,Xj) is defined as 4>(\\x-i — XjH/^/eTej), with e 2 ; equal to the distance between x 2 ; and its £ih nearest 
neighbor. When (f> is of compact support, this essentially means that if x,; is not among the first £ nearest 
neighbors of xj and vice versa, x^ and x,,- are not connected in the neighborhood graph, corresponding to 
a mutual fc-nearest neighbor graph. The parameter I replaces e as the tuning parameter, effectively setting 
the number of neighbors (degree) instead of the neighborhood range. This allows the scaling to adapt to the 
local sampling density. 

Proposition 3. Consider the generative model of Section 1.1 with surfaces Si, . . . , Sk S S{k). In terms of 
separation, for a sequence lun — > oo such that u>n = o(logiV), assume that 

5-2r» max max/ ( T V dia,m(S k ))(uJN log(N)/N k ) 1 / dk , 
6 ZT ^ k™ X .K I (r V dmm(S k )) d «/ D T 1 - d «/ D (L; N log(iV)/A fc ) 1 /^ 

Then, the local scaling version of Algorithm 1 with £ — ujN^ogN is perfectly accurate with high probability. 

The proof of Proposition 3 is in Section 6.8. As a consequence of Proposition 3, with local scaling, 
Algorithm 1 essentially achieves the separation in (5). So in that sense local scaling offers a (near-)optimal 
way of building the neighborhood graph. 

A weaker result, directly dealing with a k- nearest neighbor graph and without the optimality implications 
on the amount of separation, appears in Brito, Chvez, Quiroz and Yukich [14]. They find that, when the 
separation between clusters remains fixed, choosing fc of order \ogN makes Algorithm 1 work. However, 
assuming the underlying surfaces have diameter of order 1 and same dimension d, Maier, Hein and von 
Luxburg [38] find that the optimal k is of order N5 d , which is of order log A only when S is of order 
(log(A r )/A) 1 / d . As they point out in their paper, it makes sense to use a larger k if the separation between 
clusters is large. However, it is still not clear how to automatically choose an optimal k without information 
on the separation between clusters. 



5.2 Selecting the number of clusters 

Algorithm 2 depends on choosing the number of clusters K appropriately. Since the method relies on the 
few top eigenvectors of the matrix Z, a first approach consists in choosing K by inspecting the eigenvalues 
of Z. We provide below an estimate for the gap between the \k (Z) and Ak+i(Z), which in theory may be 
used to select the correct number of clusters. Note that the bound we derive is very crude; for example, if 
the surfaces are affine subspaces and the sampling is exact (r = 0), a sharper bound of order e 2 holds [13]. 

Proposition 4. Under the conditions of Theorem 2, Aa'(Z) — A/f+i(Z) >- TV -2 with high probability, 

The proof of Proposition 4 is in Section 6.9. In practice, this method is seen to work poorly; for example, 
in [36], choosing the number of clusters by cross-validation is observed to be more reliable. In [51], the authors 
suggest examining the few top eigenvectors instead of the eigenvalues. We do not study these methods here. 
In a slightly different context, Biau, Cadre and Pelletier [10] propose essentially to count the number of 
connected components found by Algorithm 1. The conditions stated in Theorem 1 of course guarantee this 
estimate is accurate with high probability. Their result is however more precise. 
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5.3 When the ambient dimension is large 

In a number of modern applications, such as clustering of gene expression data [48, 8], document retrieval [33, 
7] or clustering 3D objects in computer vision [29], the ambient dimension D is routinely several orders of 
magnitude larger than the number of points N. Though we can always restrict ourselves to the subspace 
where the points live, which is of dimension N — 1 or less, we consider here the situation where the ambient 
space is the unit ball in an infinite-dimensional space, for example a Hilbert space as considered in [11]. As 
defining a uniform distribution in such a space is a non-trivial endeavor [25], we modify the model slightly. 
We assume that the points are generated from the surface S as follows: x, = y; + TZj, where ~ \l/s, 
a probability measure equivalent to the uniform measure on S, and Zj ~ ^b, a probability measure with 
support in the unit ball. Outliers are directly sampled from b- 

Under this setting, Theorems 1 and 2 remain valid in the case where r -< e, where the condition (5) docs 
not involve the ambient dimension: 

e » k m^ K &am{S k ){\o E {N k )/N k f/ d K 

The arguments are essentially identical. The case e -< r is not as straightforward, since this is the regime 
where, in some sense, the effective dimension of B(Sk,r) is the ambient dimension, and the specifics of the 
distribution ^>b come into play. Also, our arguments involve using packings of B(S k ,r), so that the actual 
structure of the ambient space is critical. The same comments apply for the case where outliers are present 
in the data. 

5.4 Computational Issues 

We consider the computational complexity of each of the methods described earlier in the paper. Below, (3 
is a large enough constant. 

Building the neighborhood graph may be done by brute force in 0(pN 2 ) flops, where p is the cost of 
computing the distance between two points; for example, without further structure, pxDin dimension D. 
This may be done more effectively using an algorithm for range search, or ^-nearest neighbor search for the 
local scaling version. In low dimensions, D ~< log log N, this may be done with kd-trees in 0(N \og(N)P) 
flops. In higher dimensions, other alternatives may work better [15]. 

Once the neighborhood graph is built, Algorithm 1 extracts the connected components of the graph, 
which may be done in 0(lon N log N) flops if using the local scaling version with I = ujjy \ogN as suggested 
in Section 5.1, since in that case the maximum degree is not larger than I. Algorithm 2 extracts the 
leading eigenvectors of Z, which may be done in 0(KN\og(N) 13 ) flops, using Lanczos-type algorithms [18] 
since, using again local scaling with I = uj^logN, Z has about w^vlogA non-zero coefficients per row. 
So in both Algorithms 1 and 2 with local scaling, the total computational complexity is 0(N \og(N)°) 
flops in low dimensions; and at most 0(pN 2 + N\og(N)P) flops in higher dimensions. Algorithm 3 runs in 
0(pN 2 log(iV)' 3 ) flops in any dimension [52]. 

6 Proofs 

In all the proofs that follow, we assume for concreteness that the different sampling distributions are in fact 
uniform distributions over their respective support and that the kernel <j) has support [0,1]; we assume that 
t > and that all underlying surfaces are of diameter of order 1. The remaining cases may be treated 
similarly. We use C to denote a generic positive constant, whose actual value may change from place to 
place. 

6.1 Proof of Theorem 1 

First, two distinct clusters X k and Xg arc disjoint in the graph. Indeed, for i €E I k and j €E Ig, ||xj — Xj|| > 
6 — 2t > e and therefore a(xi, Xj) = since (f> is supported in [0, 1]. 

Now consider a single cluster of size N generated from a surface S G Sd- Let yi, . . . , y„ be an e/5-packing 
of B(S,t). Because B(S,r) C |L b (S,t) n B{y h e/5), vo\ d (B(S,t)) < £. voI d (B(S,t) n B( yj ,e/5)), so 
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that by Lemma 1, T°- d -< ne d (e A r) D - d ; on the other hand, |_L; B(S,r) B(y jt e/1Q) C B(S,t) implies 
^•volr)(-B(5,r) r\B(yj,e/10)) < vo\ d (B(S,t)), so that by Lemma 1 again, ne d {eJ\T) D - d -< T D ~ d . Hence, 
{e\J r) D ~ d e- D . 

By condition (5), N 3> rtlogn, and so by a simple modification of Lemma 2 in [6], with high probability 
each ball B(yj,e/5) contains at least one (in fact, order N/n) data point (s). If X were disconnected, we 
could group the data points into two groups in such a way that the minimum distance between the two 
groups would exceed e. By the triangle inequality, this would imply a grouping of the yj 's into two groups 
with a minimum pairwise distance of (3/5)e. The balls B(yj, e/5), j = 1, . . . ,n, would then be divided into 
two disjoint groups, which contradicts the fact that they cover the connected set S. 

6.2 Proof of Theorem 2 

We follow the strategy outlined in [40] based on verifying the following conditions (where (A4) has been 
simplified). For k = 1,...,K, let denote the submatrix of W corresponding to the index set I k - For 
i £ Ik, define L>i = ex Wij, which is the degree of Xj within the cluster it belongs to. Let Vx,. .., vjy 
denote the row vectors of V. 

(Al) For all k = 1, . . . , K, the second largest eigenvalue of is bounded above by 1 — £. 
(A2) For aRk,£ = l,...,K, with k ^ I, 

yyil<», 

frd D i D j 



(A3) For aU k = 1, . . . , K and all i £ I k , 



1 Hik \s,tei k 



-1/2 



(A4) For all k = 1, . . . , K and all i,j £ I k , D t < BD 3 . 
We present below a slightly modified version of Theorem 2 in [40] . 

Theorem 7 (Based on Th. 2 in [40]). Under (Al)-(A4), there is an orthonormal set {rx, . . . ,rx} C R K 
such that, 

K 

£ IK - r fe || 2 < WC 2 {K 2 vi + Kvl)N. 

k=l iel k 

The proof of Theorem 7 is in Section A. It is partly based on information that Andrew Ng shared with 
the author and the proof of [16, Th 4.5] by Chen and Lerman. Note that the latter deals with the special 
case where the clusters are of comparable sizes (Nk x N) and of same dimension (dk = d), and the result 
they obtain is somewhat different. 

We show below that £ >- TV -2 , that v\, v-i < N~ q for any q > 0, and that -< 1. Hence, the right hand side 
in the expression above is of order N~ q for any q > 0, since K is assumed fixed. Hence, max^ min^ ||v,— rfc|| — > 
and therefore, since the r^'s are themselves orthonormal, i^-means with mear-orthogonal initialization 
outputs the perfect clustering with high probability. We now turn to verifying (Al)-(A4), in reverse order. 

(A4): We show that, with high probability and uniformly over i,j £ I k , 

Consider a single cluster of size N, generated from sampling near a surface S of dimension d. Assume that 
(5) holds, namely Ne D /(e V r) D ~ d ^> logA^. Given Xj, a(Xi,Xj),j ^ i are i.i.d. random variables in [0, 1], 
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with mean £j x , 5(_B(x,;, e)). Note that £j x e c /(eV t)-° d by Lemma 1. Using Hoeffding's Inequality, in 
the form of inequality (2.1) in [30], there is a constant C > 0, such that 

P (| A - N£i\ > (1/2)A&) < 2exp(-C A&). 

By (5), N^i 3> log A so that, with Boole's Inequality, we conclude that D{ x A£i uniformly over all i, with 
high probability. 

(A3): Fix k = 1, . . . , K; by applying the result in (A4) with (j) 2 as a kernel, we get the following order 
of magnitude, uniformly over s G Ik, 

r ^x#//(eVT) D - 4 . 



Therefore, by (A4) and then (5), 



y S^~{e\/r) D - d *e- D «A. 
s.te/ fc 



Now, take two points, Xj € A/j and Xj ^ Because ||xj — Xj|| > <5 — 2t > wjve, we get Wy < (p^x)- With 
(A4) and (5), this implies that 



i J! W « ^ ( £ V r) D - dfc £- D 0(w W ) « JV^(wjv). 



A . 

Therefore, we can take V2 = N 3 ' 2 <P(ujn), so that v-x ~< A -9 for any q > 0. 

(A2): We apply the same arguments we just used to bound the sum on the left hand side of (A3). In 
particular, we can take v\ = N 2 4>{oj m) 2 , so that v\ < N~ q for any q > 0. 

(Al): We prove that the spectral gap £ satisfies £ >- A -2 with high probability. As suggested in [40], we 
approach this through a lower bound on the Cheeger constant. Consider a single cluster of size A, generated 
from sampling near a surface of dimension d. Assume that (5) holds, namely Ne D / (eV r) D ~ d 3> log A. That 
Z has eigenvalue 1 with multiplicity 1 results from the graph being fully connected. The Cheeger constant 
of W is defined as: 



h = 



mm 



\i\<n/2 EieiA 

where the minimum is over all subsets I C {1, . . . , A} of size |/| < A/2. The spectral gap of Z is then of 
order at least h 2 . Using (A4), we get the lower bound: 

h^(Ne D /(eWr) D - d )- 1 min gjglgggllg^zMl^ , 

|/|<JV/2 |I| 

Let yi,...,y„ be an e/5-packing of B(S,t) and define let Aj = B(S,t) n B(yj , e/5). Not only are the 
cells Ai, . . . ,A n non-empty with high probability, actually they all contain order Ne D /(eV r) D ~ d points; see 
Lemma 2 in [6] or the proof of (A4). For a fixed I such that |/| < A/2, there are necessarily two cells Aj 
and A e with ||y^ - y e \\ < e/2 such that #(Aj fl{x,:ie 7}) < #Aj/2 and #(A e n {x, : i g I}) > #A £ /2, 
for otherwise it would imply that S is disconnected. Therefore, given that points in Aj and Ag are within 
distance e, and that both Aj and Ag contain order Ne D /(e V T) D ~ d points, we have 

J2 E mi* x ^ii ^ e ^ Ne °/( e v T ^ d - 

As this is true for any I such that |J| < A/2, we have that h y A -1 . 
6.3 Proof of Theorem 4 

We start with the one dimensional case, which is substantially simpler than the situation in higher dimensions, 
as the boundaries of one-dimensional sets arc just points. We work with the supnorm for convenience and 
clarity. For two probability distributions P, Q, let H(P,Q) denote their Hcllingcr distance [34, Chap. 13]. 
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6.3.1 The case d = 1 



Consider the line segment within [0, 1] D generated by the first canonical vector, which we identify with [0, 1]. 
For / G (0, 1) and 5 > 0, define 

51 = {u G [0, 1] : u < / - 5/2}; 

5 2 = {u G [0,1] : u > f + S/2}. 

For r G (0,5/2), generate a cluster Ai (resp. X2) by sampling uniformly from B(5i,r) (resp. B(S2, t)), 
where 

B(5i,t) ={(u,a) G [0,1] x [O,^ 1 : u < / - (5/2 + r}; 

B(S* 2 ,r) = {(u,a) G [0,1] x [O,^- 1 : u > / + <5/2 + r}. 

The sampling is in proportion with the volume of these regions, i.e. Nj oc vo\D(B(Sj , r)). By sufficiency, we 
need only consider the first coordinate, effectively reducing the case to that of r = 0. From this perspective, 
the setting is that of points sampled from P/.s, the uniform distribution on [0, 1] \ [/ — 5/2, f + 5/2]. 

Let fo = 1/4 and fx = 3/4, and assume 5 < 1/4. Suppose we want to decide between P® 1 ^ and Pf^g- 
From a clustering method, we obtain a test in the following way: after grouping the points, we reject the 
null hypothesis if fi separates the two clusters. Since the interval [3/8,5/8] contains more than N/5 data 
points with high probability, the clustering method has an error rate of at least 1 /5 when as a test it makes 
an error. Fix a probability p G (0, 1). As a consequence of [34, Th. 13.1.3], and 

NH 2 {P fotS , P fl .s) = N5/(2 - 25) < N5, 

any test makes an error with probability at least p if N5 is small enough. 

6.3.2 The case d > 2 

Consider the d-dimensional affinc surface within [0, 1] D generated by the first d canonical vectors, which we 
identify with [0, l] d . For a function / : [0, l]^" 1 -> [0, 1] and S > 0, define 

5! = {(u, v) G [0, l] d - x X [0, 1] : v < /(u) - 5/2}; 

S 2 = {(u, v) G [0, if- 1 X [0, 1] : v > /(u) + 5/2}. 

For r G (0,(5/2), generate a cluster X\ (resp. X 2 ) by sampling uniformly from B{S\,t) (resp. B(S 2 ,t)), 
where 

B(5i, r) = {(u, v, a) G [0, l] d ~ l x [0, 1] x [0, r}°- d : v < /(u) - 5/2 + r}; 

B(S 2 ,t) = {(u, v, a) G [0, l]^ 1 x [0, 1] x [0, r] D - d : v > /(u) + 5/2 + r}. 

Again, the sampling is in proportion with the volume of these regions. By sufficiency, we need only consider 
the first d coordinates, effectively reducing the case to that of r = 0. Henceforth, the setting is that of points 
sampled from P/,5, the uniform distribution on [0, l} d \ Rf,s, where 

R ftS = {(u, v) G [0, l]^ 1 x [0, 1] : /(u) - 5/2 <v< /(u) + 5/2}. 

For A > 0, consider the function /o(u) = 1/2 + 5 if u G [0, A(5] d_1 and /o(u) = 1/2 otherwise. For A large 
enough relative to k, we have S\, S 2 G S(n). Let /1 = 1 — /o and consider testing Pf^ versus Pfg- From a 
clustering method, we obtain a test in the same way: after grouping the points, we reject the null hypothesis 
if the graph of /1 separates the two clusters. The region [0, ASf^ 1 x [1/2 — 5/2,1/2 + 5/2] is non-empty with 
non-negligible probability if N5 d is bounded away from zero. When this happens, the clustering method 
'misclassifies' the points falling in that region when as a test it makes an error. Fix a probability p G (0, 1). 
As a consequence of [34, Th. 13.1.3], and 

NH 2 (P f0>5 , P fu5 ) = A ■ N5 d /(2 - 25) ~< N5 d , 

any test makes an error with probability at least p if N5 d is small enough. 
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6.4 Proof of Theorem 5 



In dimension one, we show that the logarithmic factor is needed. This seems quite intuitive, since the longest 
distance between any pair of consecutive points is of order log(N)/N [2]. We build on the proof of Theorem 
4. Define m = S -1 , assumed to be an integer for simplicity. Consider fj = jS for 1 < j < m, and define 
Jo = {j ■ (l/8)m < j < (3/8)m} and J\ = {j : (5/8)m < j < (7/8)m}. Suppose we want to decide between 
Pf.g, j € Jo, and -P®^, j € Ji- With <5 < 1/8, the clustering method has an error rate of at least 1/9 when 
as a test it makes an error. 

Lemma 2. Consider testing P® N S , j G J versus Pf N s , j G Ji. 7/ 5 < Clog(N)/N uraift C < 1, then for 
any test, the sum of the probabilities of type I and type II errors tends to 1. 

The proof of Lemma 2 is in Section B.2. 

6.5 Proof of Theorem 6 

We build on the proof of Theorem 4. Define m = (AS) -1 , assumed to be an integer for simplicity. For 
a sequence itj e { — 1, 1}, j = l,...,m, consider the function fo lV (u) = 1/4 + (tt\ + • • • + Kj)5 if u G 
[(j — l)A6,jA6), for 1 < j < m. Similarly define /i i7r by replacing 1/4 with 3/4. Suppose we want to 
decide between Pf * g , tt e {-1, l} m and P®1 s , ir e {-1, l} m . With A large enough and S small, we have 
l/A + 6/2 < 1/8 and the clustering method has an error rate of at least 1/9 when as a test it makes an error. 

Lemma 3. Consider testing P®^ s , tt G {-1, 1}™ versus Pf^ s ,ir G {-1, 1}™. 7/ 

<5<C TV" 1 / 2 log(iV)- 1 loglog(^)- 1 / 2 , 

t/ien /or any <est, the sum of the probabilities of type I and type II errors tends to 1. 
The proof of Lemma 3 is in Section B.3. 

6.6 Proof of Proposition 1 

It is enough to show that, with high probability, no two outliers are within distance e. So consider Xi, . . . , xjv 
outliers sampled according to ^o. Let Di be the number of data points in 7?(xj,e) other than Xj itself. 
Then, P(A > 0) = 1 - (1 - * (P(x l ,e))) Ar °- 1 < 1 - cxp(-CW e D ), for some constant C > 0, since 
v E'o(-S( x i, e )) x £ D uniformly in i and e is small. Hence, using Boole's inequality we have 

P (3i : Di > 0) < N (l - exp(-CN e D )) < 2Ce D N$ = o(l). 

6.7 Proof of Proposition 2 

We need to prove that with high probability, all outliers have degree bounded above by uj^Ne D + log AT and 
that all non-outliers have degree exceeding that threshold. In both cases, the arguments are similar to those 
used in obtaining (A4) in the proof of Theorem 2. 

For the first part, consider the case of N outliers. Given x$, a(xj,Xj), j ^ i are i.i.d. random variables 
in [0, 1], with mean £j x ^ , o(B(x,, e)). By Hoeffding's Inequality in the form of inequality (2.1) in [30], we 
have 

P (Di > u> N N£i + log AO < exp(- log( Wjv ) log(AO). 

We conclude using this, the fact that £j x e D , and Boole's Inequality. 

For the latter, consider a single cluster of size N, generated from sampling near a surface S of dimension d. 
Assume that (6) holds, namely Ne D / (e\/r) D ~ d 3> uj^Ne + log N. In proving (A4) in the proof of Theorem 
2, we found that, with high probability, Di x x Ne D /(e V T) D ~ d uniformly over all i. Therefore, with 
high probability, Di > ujsfN^i + log N for all i. 
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6.8 Proof of Proposition 3 

It is enough to show that, for each k = l,...,K, the distance from x, to its £th nearest neighbor within 
B(S k ,r), denoted e*, satisfies 

e* x (u> N ]ag{N)/Nk) 1/dh Vr 1 "**/^ log(iV)/iVj0 1/D . 

Indeed, under the assumed separation, this will show that ej = e*, which in turn implies that the clusters 
are disjoint in the neighborhood graph; and also, that each cluster is connected in the neighborhood graph, 
since (5) is satisfied (at k) with e* in place of e and therefore the arguments for Theorem 1 apply directly. 

By (A4) in the proof of Theorem 2, we see that, uniformly over x, £ B(S k ,r), the number of points 
sampled from B(Sk, t) n B(xj, e) is of order of magnitude N k e D / (r V e) D ^ dk for e satisfying (5), that is 

e » (log(A fc )/A fc ) 1/dfc V r 1 "^/ fl (log(7V fc )/iV fe ) 1/I3 . 
In setting N k (e*) D /(r V e*) D ~ dk x £, with £ = oj N \ogN, we find e* of the right order of magnitude. 



6.9 Proof of Proposition 4 

We use the notation introduced in the proof of Theorems 2 and 7. By Proposition 5, Z has eigenvalue 1 
with multiplicity K and an eigengap bounded below by £. We also know that |jZ — Z\\p -< \/v\ + v\ by 
Proposition 6 (holding K fixed), assuming (A2)-(A3) are satisfied. We then compare the spectrum of Z and 
Z using [45, Th. IV.4.8]: 

jv 

|A m (Z)-A TO (Z)| 2 < ||Z-Z|||. 

m— 1 

This implies that 

\ K (Z) - X K+1 (Z) > X K (±) - X K+1 (Z) - 2||Z - ±\\ F > C - 2yjvi + vl 

We then conclude using the order of magnitudes, C, >- N~ 2 and v\,i/2 -< N~ q , for any q > 0, holding with 
high probability, obtained when verifying (A1)-(A3). 



A Proof of Theorem 7 

The strategy outlined in the paper of Ng, Jordan and Weiss [40] consists in first analyzing the case of infinite 
separation (<5 = oo) and then in treating the case of finite separation as a perturbation of the case of infinite 
separation, and (Al)-(A4) are used to control the amount of perturbation. We emphasize that Andrew Ng 
shared with the author the arguments underlying the proof of Theorem 7. Inspired by the proof of Chen 
and Lerman [16, Th. 4.5], we present below a slightly different proof, made more concise thanks to a recent 
result by Zwald and Blanchard [53, Th. 3]. 

A.l The case of infinite separation 

Let W denote the similarity matrix in this situation, so that Wij = Wij if there is k such that i, j G Ik, 
and Wij = otherwise. Let D,Z,U,V denote the matrices obtained from W following Algorithm 2. Let 
Ui, . . . , Ujv, and vi, . . . , Vjv, denote the row vectors of U and V respectively. 

Proposition 5. Assuming (Al) holds, the matrix Z has top eigenvalue 1 with multiplicity K and eigengap 
bounded below by (. Moreover, there is a set of orthonormal vectors r 1; . . . , £ M. K such that, for i £ 1^, 



Ui = a lk r k , with a ik 
The proof of Proposition 5 is in Section B.4. 



; and therefore Vj = r k . 
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A. 2 The case of finite separation 

For a subspace A, let Proj^ denote the orthogonal projection onto A. Also, let || • \\p denote the (matrix) 
Frobenius norm. Let Ui, . . . , Ujv, and Vj, . . . , Vjv, denote the row vectors of U and V respectively. 

Lemma 4. Let A and B be two linear subspaces in R n of same dimension p. For every orthonormal basis 
{ai, . . . , a p } of A, there is orthonormal basis {bi, . . . , h p } of B such that 

p 

^||a J -b i || 2 = ||Proj A -Proj B |||. 

i=i 

The proof of Lemma 4 is in Section B.5. 

Let L (resp. L) denote the subspace spanned by the top K eigenvectors of Z (resp. Z). By Lemma 4, 
given an orthonormal basis U for L (in matrix form), there is an orthonormal basis U for L such that 

||U-U|||.= ||Proj L -Proj i |||.. 
Note that ||U — U|||. = Xh=i ll u « — u.^ 1 1 2 . Now, by the triangle inequality, 

iiv 4 - viii < iiuiii (iiuiii - 1 - iiuj- 1 ) + \\ai - flju^ir 1 < 2||ui - ujn^ir 1 . 

By Proposition 5 and (A4), ||u;||- 2 < 9N k < ON, so that 

N N 

^||v l -v. i || 2 <20iV^||u. t -u l || 2 . 

i=l i=l 

By Proposition 5, there is a set of orthonormal vectors ri, . . . ,tk £ M. K such that, for i £ Ik, Vj = and 
therefore satisfies 

K 

]T H v * - r fcH 2 ^ 2 ^!l Pro jL - Projilll. 

fc=l i£l k 

As in [16], we use [53, Th. 3], together with Proposition 5 to obtain the following bound 

||Proj L -Proj i || F <2||Z-Z|| F /C. 
We then conclude with the following bound on the amount of perturbation. 
Proposition 6. Assume (A2)-(A3) are satisfied. Then, ||Z — Z\\p < yfK 2 V\ + Kv'£- 
The proof of Proposition 6 is in Section B.6. 



B Proofs of Auxiliary Results 
B.l Proof of Lemma 1 

Let y £ S such that ||x — y|| < r. First, consider the case r > e/2. The upper bound is obvious. For the 
lower bound, consider z = (1 — A)x + Ay with A = e/(4r), so that ||z — xj| < e/4 and ||z — y| < t — e/4, 
and therefore B(z,e/8) C B(S,t) n £>(x, e), by the triangle inequality, with B(z,e/8) x e D . Next, assume 
r < e/2. Note that B(y, e/2) C B(x, e) C B(y, (3/2)e), so we focus on proving Lemma 1 for y in place of x. 
Let yi, . . . , y Tl be a r-packing of S D B(y, e — r). We have n x (e/r) d ; indeed, 

S n B(y, e/2) c\JSD B( yj , T ) e d x vo\ d (S D B(y, e/2)) < ]T vol d (S fl Bfo , r)) x nr d ; 
i j 

S n B(y, e) D [_\ S n S(y„ r/2) ^ e d x vol d (5 fl B(y, e)) > £ vol d (5 fl B( Yj , r/2)) x nr d . 
i j 
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In a similar fashion, using the triangle inequality, 

B(S,T)nB(y,e/2) C \jB( yj ,2r) => vol D (B(S,r) n B(y, e/2)) < ^ vob(i?(y J , 2r)) x 71T 15 ; 

r) n B(y, e) D |_J Bfo, t/2) => vol D (B(S, r) fl B(y, e)) > £ vol^flfo, r/2)) x nr 13 . 

3 3 

Therefore, vo!d(B(S, t) fl £?(x, e)) x ht d , which together with n x (e/i~) d implies Lemma 1. This conclude 
the proof of Lemma 1. 

B.2 Proof of Lemma 2 

Define Q as equal to the number of points in (/j — 6/2, fj + 8/2). Let m' = \ Ji\ = \Jo\, and note that 
ml ~ m/4. We put uniform priors on both the null and the alternative. The resulting likelihoods (with 
respect to the uniform measure) under the null and alternative are 

Z s = {m!y\\ - l/m)- N ■ #{j G J s : Q = 0}, a = 0, 1. 

Note that Zq and .Zi have the same distribution under the uniform measure. It is enough to show that the 
Hellinger affinity, here E {\JZ§Z\) , tends to one [34, Th. 13.1.2]; we do that by showing that liminf E {\JZqZ\) > 
1. Since E {\/ZqZ\) > E U/ZqZi ■ 1{Zq,Zi < 2}), by dominated convergence it is enough to show that 
Z\, Zq — > 1 in probability. For that, we prove that var(Zo) = o(l), which is enough since E (Zq) = 1. From 

var(l{C, = 0}) < (1 - l/m) N , cov{l{Q = 0}, l{C fc = 0}) = (1 - 2/m) N - (1 - \/m) 2N < 0, 

we have 

var(Z ) < (to')~ 2 (1 - l/m)~ 2JV • (m'(l - l/m) N ) -< (1/m) exp{N/m). 
Therefore, var(Z ) = o(l) when m > C" 1 \og{N)/N, i.e. (5 < Clog(N)/N, where C < 1. 

B.3 Proof of Lemma 3 

Define Co,7r,j as equal to the number of points in {{j — I) AS, jAS) x (fo,x,j — 5/2, fo,ir,j + <5/2), where fo,w,j = 
1/4 + 7Ti + • • • + 7Tj; also, let Co,tt = Co,7r,j- We define Ci,ir,i and G i7r similarly, with 1/4 replaced by 3/4. 
We see ir as the increments of a nearest-neighbor path in Z. As such, we put the same prior on the null 
and the alternative, defined by the random path with low-predictability profile used in [4, Sec. 2.2.4] and 
inspired from [27, 9]. The resulting likelihoods (with respect to the uniform measure) under the null and 
alternative are 

Z s = (1 - \/{Am))- N ■ P (C s , w = 0|xi, . . . , XJV ) , s = 0, 1. 
Following the proof of Lemma 2, we show that var(Zo) = o(l). We have 

E {Zl) = (1 - l/(Am))- 2N E (P (G, = = 0k, tt')) , 

where 7r, 7r' are two independent copies distributed according to the prescribed prior on paths. Let T = |7rn7r' | ; 
because, 

P{Q 7V = ( n , = 0|vr,7r') = (1- \ir\JTr'\/{Am 2 )) N , with |tt U tt'I = 2m — T, 

we have 

E {Z 2 ) = (1 - l/(Am))~ 2N E ((1 - (2m - T)/(Am 2 )) N ) < e 2N ^ Am ^E (e T - N ^ Am ^) . 
Following the arguments in [4, Sec. 2.2.4], we get 

E ( e T ' w/(Am2) ) < exp(Clog(m) 2 loglog(m)A7m 2 ). 
Therefore, E {Z^) — * 1 when log(m) 2 loglog(m)A r /m 2 = o(l), which happens when 

<5< N- 1 ' 2 Iog(J\0 _1 logIog(^) -1 / a . 
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B.4 Proof of Proposition 5 



Without loss of generality, we assume that the points are indexed so that Ik = {Ni + ■ ■ • + iVfe_i + 1, . . . , N± + 
■ ■ ■ + Nk-i + Nk}- With this ordering, W is block diagonal, W = diagjWi, . . . , W/f}, and so is Z = 
diag{Z 1; ; . . , Z K }, with Z fc = D^ 1/2 W fc D^ 1/2 and D fc = diag{W fe • 1}. 

Each Zfc has the same spectrum as D^ 1 W^, which is the transition matrix of the Markov chain associated 

o i _ _ ° 1 /2 • 

with the graph with affinity matrix Indeed, v is an eigenvector for D fc Wt if, and only if, D fc v is 

an eigenvector for Zfc. Therefore, the largest eigenvalue of Zfc is 1, and because of (Al), that eigenvalue is 
simple. Hence, the largest eigenvalue of Z is 1 and Z has that eigenvalue with multiplicity K, which is the 
number of blocks. Moreover, the eigengap for Z is the minimum among the eigengaps of the Zfc, k = 1, . . . , K, 
which arc all bounded below by ( when (Al) holds. 

Concerning eigenvectors, = D^/ 2 l/||D^ 2 ||i? is a top (normalized) eigenvector for Z^, since D^W/c ■ 
1 = 1. Let bfc = (0,afc,0) r S M. N , where the first (resp. second) is of size Ni + ■ ■ ■ + Nk-i (rcsp. 
Nk+i + ■ ■ ■ + Nk)- Any system of K orthogonal eigenvectors of Z for the eigenvalue 1 is a rotation of 
B = [bi, . . . , b/f]. Therefore, there is an orthogonal matrix R = [ri, . . . , yk] such that U = BR T . That is, 
for i £ Ik, the row vectors of U satisfy = a.^-rfc. After normalization, we get v.; = Yk for i E Ik- 

B.5 Proof of Lemma 4 

Let B\ , . . . , 6 p denote the principal angles between these two subspaces [26] . It is well-known that 

||Proj A -Proj B || 2 , = 2^sin 2 ^, 

3=1 

(The quantity UProj^ — Proj B ||i? is called in [22] the projection F-norm distance between the subspaces A 
and B.) On the other hand, by definition of the principal angles, there are orthonormal bases, {a°, . . . , a°} 
for A and {b?, . . . , b°} for B, such cos (9, = (a°) T b°. Therefore, 

±\\*°-b°f = 2± sin 2 *,. 

i=i 3=1 

We then simply choose an orthogonal matrix R such that a, = Ra° and define b, = Rb°. 

B.6 Proof of Proposition 6 

This comes from directly summing over blocks. Indeed, 

ii z - n\ = E E ^ ^ + EEE 4- (8) 

fe i j'G/fc k^t iGl k jeh 

Note that for all i E Ik, Di — Di ~ J2j<£i k Wij so that Di — Di > and by (A3) we have 

^<l + ^Ck 1/2 , C k := E 0) 
A s ^ Ik D s D t 

For i, j E Ifc, using (9) we get 

VV A D 3 J D * D 3 

Hence, the first term on the right hand side of (8) is bounded from above by Kv\. 

For i e I k and j E I e , Zfj = W^/(DiDj) < W^/(DiDj), since A < D l for all i. Hence, by (A2), the 
second term on the right hand side of (8) is bounded from above by K 2 v\. 
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